Before we begin, you will need to install these packages
Now we load a few R packages
Today we are going to talk about getting data, examples of common data formats, and useful tools to access data.
First let’s have a bit of a philosophical discussion about data.
As data analysts, this is what we wished data looked like whenever we start a project
However, the reality, is data is rarely in that form in comes in all types of “raw” formats that need to be transformed into a “clean” format.
For example, in field of genomics, raw data looks like something like this:
Or if you are interested in analyzing data from Twitter:
Or data from Electronic Healthcare Records (EHRs):
We all have our scary spreadsheet tales. Here is Jenny Bryan from RStudio and UBC actually asking for some of those spreasheet tales on twitter.
For example, this is an actual spreadsheet from Enron in 2001:
From https://simplystatistics.org/2016/07/20/relativity-raw-data/ raw data is defined as data…
…if you have done no processing, manipulation, coding, or analysis of the data. In other words, the file you received from the person before you is untouched. But it may not be the rawest version of the data. The person who gave you the raw data may have done some computations. They have a different “raw data set”.
Data lives anywhere and everywhere. Data might be stored simply in a .csv or .txt file. Data might be stored in an Excel or Google Spreadsheet. Data might be stored in large databases that require users to write special functions to interact with to extract the data they are interested in.
For example, you may have heard of the terms mySQL or MongoDB.
From Wikipedia, MySQL is defined as an open-source relational database management system (RDBMS). Its name is a combination of “My”, the name of co-founder Michael Widenius’s daughter,[7] and “SQL”, the abbreviation for Structured Query Language..
From Wikipeda, MongoDB is defined as “a free and open-source cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with schemata.”
So after reading that, we get the sense that there are multiple ways large databases can be structured, data can be formatted and interacted with. In addition, we see that database programs (e.g. MySQL and MongoDB) can also interact with each other.
We will learn more about JSON in a bit.
A great article in PeerJ was written titled How to share data for collaboration, in which the authors describe a set of guidelines for sharing data:
We highlight the need to provide raw data to the statistician, the importance of consistent formatting, and the necessity of including all essential experimental information and pre-processing steps carried out to the statistician. With these guidelines we hope to avoid errors and delays in data analysis. the importance of consistent formatting, and the necessity of including all essential experimental information and pre-processing steps carried out to the statistician.
It’s a great paper that describes the information you should pass to a statistician to facilitate the most efficient and timely analysis. Specifically:
tidy dataset, but possibly yes.md, .txt or Word file.First let’s talk about a few important things before we download any data.
When you are starting a data analysis, you have already learned about the use of .Rproj files. When you open up a .Rproj file, RStudio changes the path (location on your computer) to the .Rproj location.
After opening up a .Rproj file, you can test this by
When you open up someone else’s R code or analysis, you might also see the setwd() function being used which explicitly tells R to change the absolute path or absolute location of which directory to move into.
For example, say I want to clone a GitHub repo from Roger, which has 100 R script files, and in every one of those files at the top is:
The problem is, if I want to use his code, I will need to go and hand-edit every single one of those paths (C:\Users\Roger\path\only\that\Roger\has) to the path that I want to use on my computer or wherever I saved the folder on my computer (e.g. /Users/Stephanie/Documents/path/only/I/have).
So instead of absolute paths:
A better idea is to use relative paths:
An even better idea is to use the here R package will recognize the top-level directory of a Git repo and supports building all paths relative to that. For more on project-oriented workflow suggestions, read this post from Jenny Bryan.
If you want to download a file, one way to use the file.exists(), dir.create() and list.files() functions.
file.exists("my/relative/path") = logical test if the file existsdir.create("my/relative/path") = create a folderlist.files("my/relative/path") = list contents of folderLet’s say we wanted to find out where are all the Fixed Speed Cameras in Baltimore?
To do this, we can use the Open Baltimore API which has information on the locations of fixed speed cameras in Baltimore.
In case you aren’t familiar with fixed speed cameras, the website states:
Motorists who drive aggressively and exceed the posted speed limit by at least 12 miles per hour will receive $40 citations in the mail. These citations are not reported to insurance companies and no license points are assigned. Notification signs will be placed at all speed enforcement locations so that motorists will be aware that they are approaching a speed check zone. The goal of the program is to make the streets of Baltimore safer for everyone by changing aggressive driving behavior. In addition to the eight portable speed enforcement units, the city has retrofitted 50 red light camera locations with the automated speed enforcement technology.
When we go to the website, we see that the data can be provided to us as a .csv file. To download in this data, we can do the following:
file_url <- paste0("https://data.baltimorecity.gov/api/",
"views/dz54-2aru/rows.csv?accessType=DOWNLOAD")
download.file(file_url,
destfile="../data/cameras.csv")
list.files("../data/.")Alternatively, if we want to only download the file once each time we knit our reproducible report or homework or project, we can us wrap the code above into a !file.exists() function.
if(!file.exists("../data/cameras.csv")){
file_url <- paste0("https://data.baltimorecity.gov/api/",
"views/dz54-2aru/rows.csv?accessType=DOWNLOAD")
download.file(file_url,
destfile = "../data/cameras.csv")
}
list.files("../data/.")## [1] "cameras.csv" "KFF" "NMMAPS"
From there, we can read in the cameras.csv like we have already learned how to do using the readr::read_csv() function:
## Parsed with column specification:
## cols(
## address = col_character(),
## direction = col_character(),
## street = col_character(),
## crossStreet = col_character(),
## intersection = col_character(),
## `Location 1` = col_character()
## )
## # A tibble: 80 x 6
## address direction street crossStreet intersection `Location 1`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 S CATON AV… N/B Caton … Benson Ave Caton Ave & B… (39.269377996…
## 2 S CATON AV… S/B Caton … Benson Ave Caton Ave & B… (39.269315789…
## 3 WILKENS AV… E/B Wilken… Pine Heigh… Wilkens Ave &… (39.272025230…
## 4 THE ALAMED… S/B The Al… 33rd St The Alameda … (39.328501314…
## 5 E 33RD ST … E/B E 33rd The Alameda E 33rd & The… (39.328341062…
## 6 ERDMAN AVE… E/B Erdman Macon St Erdman & Mac… (39.306804567…
## 7 ERDMAN AVE… W/B Erdman Macon St Erdman & Mac… (39.306966535…
## 8 N CHARLES … S/B Charles Lake Ave Charles & Lak… (39.369053529…
## 9 E MADISON … W/B Madison Caroline St Madison & Ca… (39.299325766…
## 10 ORLEANS ST… E/B Orleans Linwood Ave Orleans & L… (39.295866198…
## # ... with 70 more rows
jsonliteJSON (or JavaScript Object Notation) is a file format that stores information in human-readable, organized, logical, easy-to-access manner.
For example, here is what a JSON file looks like:
var stephanie = {
"age" : "33",
"hometown" : "Baltimore, MD",
"gender" : "female",
"cars" : {
"car1" : "Hyundai Elantra"
"car2" : "Toyota Rav4"
"car3" : "Honda CR-V"
}
}Some features about JSON object:
{}Let’s say we want to use the GitHub API to find out how many of my GitHub repositories have open issues?
We will use the jsonlite R package and the fromJSON() function to convert from a JSON object to a data frame.
We will read in a JSON file located at https://api.github.com/users/stephaniehicks/repos
github_url = "https://api.github.com/users/stephaniehicks/repos"
library(jsonlite)
jsonData <- fromJSON(github_url)The function fromJSON() has now converted the JSON file into a data frame with the names:
## [1] "id" "node_id" "name"
## [4] "full_name" "private" "owner"
## [7] "html_url" "description" "fork"
## [10] "url" "forks_url" "keys_url"
## [13] "collaborators_url" "teams_url" "hooks_url"
## [16] "issue_events_url" "events_url" "assignees_url"
## [19] "branches_url" "tags_url" "blobs_url"
## [22] "git_tags_url" "git_refs_url" "trees_url"
## [25] "statuses_url" "languages_url" "stargazers_url"
## [28] "contributors_url" "subscribers_url" "subscription_url"
## [31] "commits_url" "git_commits_url" "comments_url"
## [34] "issue_comment_url" "contents_url" "compare_url"
## [37] "merges_url" "archive_url" "downloads_url"
## [40] "issues_url" "pulls_url" "milestones_url"
## [43] "notifications_url" "labels_url" "releases_url"
## [46] "deployments_url" "created_at" "updated_at"
## [49] "pushed_at" "git_url" "ssh_url"
## [52] "clone_url" "svn_url" "homepage"
## [55] "size" "stargazers_count" "watchers_count"
## [58] "language" "has_issues" "has_projects"
## [61] "has_downloads" "has_wiki" "has_pages"
## [64] "forks_count" "mirror_url" "archived"
## [67] "open_issues_count" "license" "forks"
## [70] "open_issues" "watchers" "default_branch"
To find out how many repos that I have with open issues, we can just create a table:
##
## 0 1
## 27 3
Whew! Not as many as I thought.
How many do you have?
Finally, I will leave you with a few other examples of using GitHub API:
dplyr package?rvestDo we want to purchase a book on Amazon?
Next we are going to learn about what to do if your data is on a website (XML or HTML) formatted to be read by humans instead of R.
We will use the (really powerful) rvest R package to do what is often called “scraping data from the web”.
Before we do that, we need to set up a few things:
We’re going to be scraping this page: it just contains the (first page of) reviews of the ggplot2 book by Hadley Wickham.
We use the rvest package to download this page.
Now h is an xml_document that contains the contents of the page:
## {xml_document}
## <html lang="en-us" class="a-no-js" data-19ax5a9jf="dingo">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="a-m-us a-aui_149818-c a-aui_152852-c a-aui_157141-c a-a ...
How can you actually pull the interesting information out? That’s where CSS selectors come in.
CSS selectors are a way to specify a subset of nodes (that is, units of content) on a web page (e.g., just getting the titles of reviews). CSS selectors are very powerful and not too challenging to master- here’s a great tutorial But honestly you can get a lot done even with very little understanding, by using a tool called SelectorGadget.
Install the SelectorGadget on your web browser. (If you use Chrome you can use the Chrome extension, otherwise drag the provided link into your bookmarks bar). Here’s a guide for how to use it with rvest to “point-and-click” your way to a working selector.
For example, if you just wanted the titles, you’ll end up with a selector that looks something like .a-color-base. You can pipe your HTML object along with that selector into the html_nodes function, to select just those nodes:
## {xml_nodeset (12)}
## [1] <span class="a-size-small a-color-base reviews-sort-order-label a-t ...
## [2] <span class="a-size-small a-color-base reviews-filter-by-label a-te ...
## [3] <a data-hook="review-title" class="a-size-base a-link-normal review ...
## [4] <a data-hook="review-title" class="a-size-base a-link-normal review ...
## [5] <a data-hook="review-title" class="a-size-base a-link-normal review ...
## [6] <a data-hook="review-title" class="a-size-base a-link-normal review ...
## [7] <a data-hook="review-title" class="a-size-base a-link-normal review ...
## [8] <a data-hook="review-title" class="a-size-base a-link-normal review ...
## [9] <a data-hook="review-title" class="a-size-base a-link-normal review ...
## [10] <a data-hook="review-title" class="a-size-base a-link-normal review ...
## [11] <a data-hook="review-title" class="a-size-base a-link-normal review ...
## [12] <a data-hook="review-title" class="a-size-base a-link-normal review ...
But you need the text from each of these, not the full tags. Pipe to the html_text function to pull these out:
## [1] "Sort by"
## [2] "Filter by"
## [3] "Still a great package and highly worth learning - but the text is getting quite out of date."
## [4] "Must-have reference for R graphics"
## [5] "Excellent"
## [6] "Nice resource, but already out of date"
## [7] "The best guide to the best graphics (I think) out there."
## [8] "Graphing in R"
## [9] "Excellent content, poor adaptation to kindle"
## [10] "Excellent R resource for the Kindle"
## [11] "Great book, outdated"
## [12] "Indispensable resource for ggplot2 users"
Now we’ve extracted something useful! Similarly, let’s grab the format (hardcover or paperback). Some experimentation with SelectorGadget shows it’s:
## [1] "Format: Paperback" "Format: Paperback"
## [3] "Format: Kindle Edition" "Format: Paperback"
## [5] "Format: Paperback" "Format: Kindle Edition"
## [7] "Format: Kindle Edition" "Format: Kindle Edition"
## [9] "Format: Paperback" "Format: Paperback"
Now, we may be annoyed that it always starts with Format:. Let’s introduce the stringr package.
formats <- h %>%
html_nodes(".a-size-mini.a-color-secondary") %>%
html_text() %>%
str_replace("Format: ", "")
formats## [1] "Paperback" "Paperback" "Kindle Edition" "Paperback"
## [5] "Paperback" "Kindle Edition" "Kindle Edition" "Kindle Edition"
## [9] "Paperback" "Paperback"
We could do similar exercise for extracting the number of stars and whether or not someone found a review useful. This would help us decide if we were interested in purchasing the book!
The Huffington Post has an API which provides US opinion poll data on various political races and other non-political opinion polls.
There is an R package called pollstR which provides an easy user interface.
For example, the API has data on the Trump Job Approval
Here we use the pollster_charts_polls() function:
## Parsed with column specification:
## cols(
## Approve = col_double(),
## Disapprove = col_double(),
## Undecided = col_double(),
## poll_slug = col_character(),
## survey_house = col_character(),
## start_date = col_date(format = ""),
## end_date = col_date(format = ""),
## question_text = col_character(),
## sample_subpopulation = col_character(),
## observations = col_integer(),
## margin_of_error = col_double(),
## mode = col_character(),
## partisanship = col_character(),
## partisan_affiliation = col_character()
## )
We can see what’s in the object:
## [1] "content" "url" "response"
The url links to the data itself
## [1] "https://elections.huffingtonpost.com/pollster/api/v2/charts/trump-job-approval/pollster-chart-poll-questions.tsv"
The content contains the polling data:
## # A tibble: 887 x 14
## Approve Disapprove Undecided poll_slug survey_house start_date
## <dbl> <dbl> <dbl> <chr> <chr> <date>
## 1 47 50 NA rasmussen-28863 Rasmussen 2018-09-20
## 2 40 56 NA gallup-28860 Gallup 2018-09-17
## 3 42 52 5 politico-morni… Politico/Morni… 2018-09-17
## 4 48 50 NA rasmussen-28855 Rasmussen 2018-09-17
## 5 44 52 3 fox-28866 FOX 2018-09-16
## 6 44 52 4 nbc-wsj-28865 NBC/WSJ 2018-09-16
## 7 44 54 2 surveymonkey-2… SurveyMonkey 2018-09-13
## 8 41 50 9 yougov-economi… YouGov/Economi… 2018-09-16
## 9 42 54 4 ipsos-reuters-… Ipsos/Reuters 2018-09-12
## 10 49 50 NA rasmussen-28848 Rasmussen 2018-09-12
## # ... with 877 more rows, and 8 more variables: end_date <date>,
## # question_text <chr>, sample_subpopulation <chr>, observations <int>,
## # margin_of_error <dbl>, mode <chr>, partisanship <chr>,
## # partisan_affiliation <chr>
We will learn more about polling data in Term 2 of this course.
jsonlite for JSON (e.g. GitHub API)rvest to grab all the exact elements you want (e.g. book reviews)
httr for tools to work with URLs and HTTPgooglesheets to interact with Google Sheets in Rgoogledrive](https://googledrive.tidyverse.org](http://googledrive.tidyverse.org/) to interact with your Google Drive